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Emergence of foundation models 
and generative Al (gen Al) has 
introduced a new era for building 
Al systems. 


Introduction 


The emergence of foundation models and generative Al (gen Al) has introduced a new era 
for building Al systems. Selecting the right model from a diverse range of architectures 

and sizes, curating data, engineering optimal prompts, tuning models for specific tasks, 
grounding model outputs in real-world data, optimizing hardware - these are just a few of the 
novel challenges that large models introduce. 


This whitepaper delves into the fundamental tenets of MLOps and the necessary adaptations 
required for the domain of gen Al and Foundation Models. We also examine the diverse range 
of Vertex Al products, specifically tailored to address the unique demands of foundation 
models and gen Al-based applications. Through this exploration we uncover how Vertex Al, 
with its solid foundations of Al infrastructure and MLOps tools, expands its capabilities to 
provide a comprehensive MLOps platform for gen Al. 
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What are DevOps and MLOps? 


DevOps is a software engineering methodology that aims to bridge the gap between 
development (Dev) and operations (Ops). It promotes collaboration, automation, and 
continuous improvement to streamline the software development lifecycle, introducing 
practices such as continuous integration and continuous delivery. 


MLOps builds upon DevOps principles to address the unique challenges of operationalizing 
Machine Learning systems rapidly and reliably. In particular, MLOps tackles the experimental 
nature of ML through practices like: 


* Data validation: Ensuring the quality and integrity of training data. 
* Model evaluation: Rigorously assessing model performance with appropriate metrics. 
* Model monitoring: Tracking model behavior in production to detect and mitigate drift. 


* Tracking & reproducibility: Maintaining meticulous records for experiment tracking and 
result reproduction. 


Machine learning workflow 


Model monitoring Model training 


Model deployment Model evaluation 


and serving and iteration 


Oia ae 


Figure 1. Machine learning workflow 
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Lifecycle of a gen Al system 


Imagine deploying your first chatbot after months of dedicated work, and it's now interacting 
with users and answering questions. Behind this seemingly simple interaction lies the 
complex and fascinating life cycle of a gen Al System, which can be broken down into five 
key moments. 


First in the discovery phase, developers and Al engineers must navigate the expanding 
landscape of available models to identify the most suitable one for their specific gen Al 
application. They must consider each model's strengths, weaknesses, and costs to make an 
informed decision. 


Next, development and experimentation become paramount, with prompt engineering 
playing a crucial role in crafting and refining input prompts to elicit desired outputs based on 
an understanding of the model's intricacies. Few-shot learning, where examples are provided, 
can further guide model behavior, while additional customization may involve parameter- 
efficient fine-tuning (PEFT). Most gen Al systems also involve model chaining, which refers to 
orchestrating calls to multiple models in a specific sequence to create a workflow. 


Data engineering practices have a critical role across all development stages, with factual 
grounding (ensuring the model's outputs are based on accurate, up-to-date information) and 
recent data from internal and enterprise systems being essential for reliable outputs. Tuning 
data is often needed to adapt models to specific tasks, styles, or to rectify persistent errors. 


Deployment needs to manage many new artifacts in the deployment process, including 
prompt templates, chain definitions, embedding models, retrieval data stores, and fine-tuned 
model adapters among others. These artifacts each have unique governance requirements, 
necessitating careful management throughout development and deployment. Gen Al system 
deployment also needs to account for the technical capabilities of the target infrastructure, 
ensuring that system hardware requirements are fulfilled. 
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Continuous monitoring in production ensures improved application performance and 
maintains safety standards through responsible Al techniques, such as ensuring fairness, 
transparency, and accountability in the model's outputs. 


Continuous Improvement as a concept is still key for Gen Al-based applications, though 
with a twist. For most Gen Al applications, instead of training models from scratch, we’re 
taking foundation models (FMs) and then adapting them to our specific use case. This means 
constantly tweaking these FMs through prompting techniques, swapping them out for newer 
versions, or even combining multiple models for enhanced performance, cost efficiency, or 
reduced latency. Traditional continuous training still holds relevance for scenarios when 
recurrent fine-tuning or incorporating human feedback loops are still needed. 


Naturally, this lifecycle assumes that the foundational model powering the gen Al system is 
already operationalized. It's important to recognize that not all organizations will be directly 
involved in this part of the process. In particular, the operationalization of foundational 
models is a specialized set of tasks that is typically only relevant for a select few companies 
with the necessary resources and expertise. 


Because of that, this whitepaper will focus on practices required to operationalize gen Al 
applications using and adapting existing foundation models, referring to other whitepapers in 


the book should you want to deepdive into how foundational models are operationalized. 


This includes active areas of research such as model pre-training, alignment (ensuring the 
model's outputs align with the desired goals and values), evaluation or serving. 
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Example process | 


Operationalize 


Foundational 

models Serving / A Instruction ae 
— Alignment — . — Pre-trainin 

(generally not Release g tuning g 

specific toa 

use case) 


Operationalize 
GenAl systems 
(specific to a 

use case) | 


data Experiment Deploy 


Discover | — | Curate | = | Develop & | 


Release: || Monitor 


Govern | 


Figure 2. Lifecycle of a Foundational Model & gen Al system and relative operationalization practices 


Discover 


As mentioned before, building foundational models from scratch is resource-intensive. 
Training costs and data requirements are substantial, pushing most practitioners towards 
adapting existing foundation models through techniques like fine-tuning and prompt 
engineering. This shift highlights a crucial need: efficiently discovering the optimal foundation 
model for a given use case. 


These two characteristics of the gen Al landscape make model discovery an essential 
MLOps practice: 


1. An abundance of models: The past year has witnessed an explosion of open-source 
and proprietary foundation models. Navigating this complex landscape, each with varying 
architectures, sizes, training datasets, and licenses, requires a systematic approach to 
identify suitable candidates for further evaluation. 
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2. No one-size-fits-all solution: Each use case presents unique requirements, demanding 
nuanced analysis of available models across multiple dimensions. 


Here are some factors to consider when exploring models: 


1. Quality: Early assessments can involve running test prompts or analyzing public 
benchmarks and metrics to gauge output quality. 


2. Latency & throughput: These factors directly impact user experience. A chatbot 
demands lower latency than batch-processed summarization tasks. 


3. Development & maintenance time: Consider the time investment for both initial 
development and ongoing maintenance. Managed models often require less effort than 
self-deployed open-source alternatives. 


4. Usage cost: Factor in infrastructure and consumption costs associated with using the 
chosen model. 


5. Compliance: Assess the model's ability to adhere to relevant regulations and 


licensing terms. 


Because the activity of discovery has become so important for gen Al systems, many model 
discoverability platforms were created to support this need. An example of that is Vertex 
Model Garden,’ which is explored later in this whitepaper. 


Develop and experiment 


The process of development and experimentation remains iterative and orchestrated 
while building gen Al applications. Each experimental iteration involves a tripartite 
interplay between data refinement, foundation model(s) selection and adaptation, and 
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rigorous evaluation. Evaluation provides crucial feedback, guiding subsequent iterations 

in a continuous feedback loop. Subpar performance might call for gathering more data, 
augmenting data, or further curating the data. Similarly, the adaptation of the foundation 
model itself might need tweaking - optimizing prompts, applying fine-tuning techniques, or 
even swapping it out for a different one altogether. This iterative refinement cycle, driven by 
evaluation insights, is just as critical for optimizing gen Al applications as it’s always been for 
traditional machine learning. 


The foundational model paradigm 


Foundation models differ from predictive models most importantly because they are multi- 
purpose models. Instead of being trained for a single purpose, on data specific to that 
task, foundation models are trained on broad datasets, and therefore can be applied to 
many different use cases. This distinction brings with it several more important differences 
between foundation models and predictive models. 


Foundation models also exhibit what are known as ‘emergent properties’,? capabilities that 
emerge in response to specific input without additional training. Predictive models are 
only able to perform the single function they were trained for; a traditional French-English 
translation model, for instance, cannot also solve math problems. 


Foundation models are also highly sensitive to changes in their input. The output of the 
model and the task it performs are strongly affected, indeed determined, by the input to the 
model. A foundation model can be made to perform translation, generation, or classification 
tasks simply by changing the input. Even insignificant changes to the input can affect its 
ability to correctly perform that task. 
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These new properties of foundation models have created a corresponding paradigm shift 

in the practices required to develop and operationalize Gen Al systems. While models in 

the predictive Al context are self-sufficient and task-specific, gen Al models are multi- 
purpose and need an additional element beyond the user input to function as part of a 

gen Al Application: a prompt, and more specifically, a prompt template, defined as a set of 
instructions and examples along with placeholders to accommodate user input. A prompt 
template, along with dynamic data such as user input, can be combined to create a complete 
prompt, the text that is passed as input to the foundation model. 


User input Prompt 


Is the model T400 available? 
You are a helpful assistant 
for Cymbal bikes, you help 


Prompt Template customers buying new bikes. 


You are an helpful 


: . User: Can I buy a new bike? 
assistant for Cymbal bikes, 


Instructions Output: Sure, we offer a 


you help customers buying 
new bikes. 


Examples: 


selection of new bikes for 
you to choose from! 


; User: Is the model 1400 
User: Can I buy a new bike? _ |----- availabile? 
Examples Output: Sure, we offer a Output: 


User input 
ith 


wi 
Placeholders 


selection of new bikes for 
you to choose from! 


User: {{ Question }} 
OurpuE? 


Figure 3. How Prompt Template and User input can be combined to create a prompt 
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The core component of LLM Systems: A prompted 
model component 


The presence of the prompt element is a distinguishing feature of gen Al applications. 
Neither the model nor the prompt is sufficient for the generation of content; gen Al needs the 
combination of both. We refer to the combination as a ‘prompted model component’. This 

is the smallest independent component sufficient to create an LLM application. The prompt 
does not need to be very complicated. It can be a simple instruction, such as “translate 

the following sentence from English to French”, followed by the sentence to be translated. 
Without that preliminary instruction, though, a foundation model would not perform the 
desired translation task. So a prompt, even just a basic instruction, is necessary along with 
the input to get the foundation model to do the task required by the application. 


Predictive Al : Bradictive 


E.g. Tabular data 


Prompted Model 


Generative Al F Prompt 
component eee Template |? Response 


E.g. Textual input 


Figure 4. Predictive Al unit compared with the gen Al unit 
This introduces an important distinction when it comes to MLOps practices for gen Al. In 


the development of a gen Al System, experimentation and iteration need to be done in the 
context of a prompted model component, the combination of a model and a prompt. The Gen 
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Al experimentation cycle typically begins with testing variations of the prompt — changing the 
wording of the instructions, providing additional context, or including relevant examples, etc., 
and evaluating the impact of those changes. This practice is commonly referred to as prompt 
engineering. 


Prompt engineering involves two iterative steps: 


1. Prompting: Crafting and refining prompts to elicit desired behaviors from a foundational 
model for a specific use case. 


2. Evaluation: Assessing the model's outputs, ideally programmatically, to gauge its 
understanding and success in fulfilling the prompt's instructions. 


Log as experiment & iterate 


Foundational 


2 Evaluate t--- 
Prompt Model 


Figure 5. The activity of prompt engineering 


Results of an evaluation can be optionally registered as part of an experiment, to allow for 
result tracking. Since the prompt itself is a core element of the prompt engineering process, 
it becomes a first class citizen within the artifacts part of the experiment. 


However, we need to identify which type of artifacts they are. In the good old days of 
Predictive Al, we had clear lines - data was one thing, pipelines and code another. But with 
the “Prompt” paradigm in gen Al, those lines get blurry. Think about it: prompts can include 
anything from context, instructions, examples, guardrails to actual internal or external data 
pulled from somewhere else. So, are prompts data? Are they code? 
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To address these questions, a hybrid approach is needed, recognizing that a prompt has 
different components and requires different management strategies. Let’s break it down: 


Prompt as Data: Some parts of the prompt will act just like data. Elements like few-shot 
examples, knowledge bases, and user queries are essentially data points. For these 
components, we need data-centric MLOps practices such as data validation, drift detection, 
and lifecycle management. 


Prompt as Code: Other components such as context, prompt templates, guardrails are mode 
code-like. They define the structure and rules of the prompt itself. Here, we need code- 
centric practices such as approval processes, code versioning, and testing. 


As a result, when applying MLOps practices to gen Al, it becomes important to have in place 
processes that give developers easy storage, retrieval, tracking, and modification of prompts. 
This allows for fast iteration and principled experimentation. Often one version of a prompt 
will work well with a specific version of the model and less well with a different version. In 
tracking the results of an experiment, both the prompt and its components version, and the 
model version must be recorded and stored along with metrics and output data produced by 
the prompted model. 


The fact that development and experimentation in gen Al requires working with the prompt 
and the model together introduces changes in some of the common MLOps practices, 
compared to the predictive Al case in which experimentation is done by changing the model 
alone. Specifically, several of the MLOps practices need to be expanded to consider the 
prompted model component together as a unit. This includes practices like evaluation, 
experiment tracking, model adaptation and deployment, and artifact management, 
which will be discussed below in this whitepaper. 
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Chain & Augment 


Gen Al models, particularly large language models (LLMs), face inherent challenges in 
maintaining recency and avoiding hallucinations. Encoding new information into LLMs 
requires expensive and data-intensive pre-training, posing a significant hurdle. Additionally, 
LLMs might be unable to solve complex challenges, especially when step-by-step reasoning 
is required. Depending on the use case, leveraging only one prompted model to perform 

a particular generation might not be sufficient. To solve this issue, leveraging a divide and 
conquer approach, several prompted models can be connected together, along with calls 
to external APIs and logic expressed as code. A sequence of prompted model components 
connected together in this way is commonly known as a chain. 


Log as experiment & iterate 


Chain 


a 
components Evaluate 


Prompt Founsational Prompt Roundatichal 
lodel lodel 
Augment 
— | with external — 


Prompt Model data Prompt Model 


Logic as 
code 


Figure 6. Components of a chain and relative development process 
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Two common chain-based patterns that have emerged to mitigate recency and 
hallucinations are retrieval augmented generation (RAG)? and Agents. 


- RAG addresses these challenges by augmenting pre-trained models with 
“knowledge” retrieved from databases, bypassing the need for pre-training. This 
enables grounding and reduces hallucinations by incorporating up-to-date factual 
information directly into the generation process. 


- Agents, popularized by the ReAct prompting technique,’ leverage LLMs as mediators 
interacting with various tools, including RAG systems, internal or external APIs, 
custom extensions, or even with other agents. This enables complex queries and 
real-time actions by dynamically selecting and utilizing relevant information sources. 
The LLM, acting as an agent, interprets the user’s query, decides which tool to utilize, 
and how to formulate the response based on the retrieved information. 


RAG and Agents approaches can be combined to create multi-agent systems connected 
to large information networks, enabling sophisticated query handling and real-time 
decision-making. 


The orchestration of different models, logic and APIs is not a novelty of gen Al 
Applications. For example, recommendation engines have long combined collaborative 
filtering models, content-based models, and business rules to generate personalized 
product recommendations for users. Similarly, in fraud detection, machine learning 
models are integrated with rule-based systems and external data sources to identify 
suspicious activities. 
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What makes these chains of gen Al components different, is that, we can't a priori 
characterize or cover the distribution of component inputs, which makes the individual 
components much harder to evaluate and maintain in isolation. 


This results in a paradigm shift in how Al applications are being developed for gen Al. 


Unlike Predictive Al where it is often possible to iterate on the separate models and 
components in isolation to then chain in the Al application, in gen Al it’s often easier to 
develop a chain in integration, performing experimentation on the chain end-to-end, iterating 
over chaining strategies, prompts, the underlying foundational models and other APIs in 

a coordinated manner to achieve a specific goal. No feature engineering, data collection, 

or further model training cycles is often needed; just changes to the wording of the 

prompt template. 


The shift towards MLOps for gen Al, in contrast to predictive Al, brings forth a new set of 
demands. Let's break down these key differences: 


1. Evaluation: Because of their tight coupling, chains need end-to-end evaluation, not just 
on a per-component basis, to gauge their overall performance and the quality of their 
output. In terms of evaluation techniques and metrics, evaluating chains is not dissimilar 
to evaluating prompted models. Please refer to the below segment on evaluation for more 
details on these approaches. 


2. Versioning: A chain needs to be managed as a complete artifact in its entirety. The chain 
configuration should be tracked with its own revision history for analysis, reproducibility, 
and understanding the impact of changes on output. Logging should also include the 
inputs, outputs, and intermediate states of the chain, and any chain configurations used 
during each execution. 
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3. Continuous Monitoring: Establishing proactive monitoring systems is vital for detecting 
performance degradation, data drift, or unexpected behavior in the chain. This ensures 
early identification of potential issues to maintain the quality of the generated output. The 
activity of monitoring Chains is discussed in detail in the section ‘Logging and Monitoring’. 


4. Introspection: The ability to inspect the internal data flows of a chain (inputs and outputs 
from each component) as well as the inputs and outputs of the entire chain is paramount. 
By providing visibility into the data flowing through the chain and the resulting content, 
developers can pinpoint the sources of errors, biases, or undesirable behavior. 


Log as experiment & iterate 


Chain 


components —_—_ Evaluate _—if------ ' 


=> Curate Data _—_ Tune —_- 


v 


Foundational 
Model 


Prompt Foundational Prompt 


Other 
Prompt Model Prompt Model components 


Embedding models can also be tuned independently 


Figure 7. Putting together chains, prompted models and model tuning 


There are several products in Vertex Al that can support the need for chaining and 
augmentation, including Grounding as a service,° Extensions,° and Vector Search,’ Agent 
Builder.® We discuss the products in the section “Role of a Al Platform”. Langchain’ is also 
integrated with the Vertex SDK,'° and can be used alongside the core Vertex products to 
define and configure gen Al chained applications. 
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Tuning & training 


When developing a gen Al use case and a specific task that involves LLMs, it can be difficult, 
especially for complex tasks, to rely on only prompt engineering and chaining to solve it. 

To improve task performance practitioners often also need to fine-tune the model directly. 
Fine-tuning lets you actively change the layers or a subset of layers of the LLM to optimize 
the capability of the model to perform a certain task. Two of the most common ways of 
tuning a model are: 


1. Supervised fine-tuning: This is where we train the model in a supervised manner, teaching 
it to predict the right output sequence for a given input. 


2. Reinforcement Learning from Human Feedback (RLHF): In this approach, we first train 
a reward model to predict what humans would prefer as a response. Then, we use this 
reward model to nudge the LLM in the right direction during the tuning process. Like 
having a panel of human judges guiding the model's learning. 


Log as experiment & iterate 


Figure 8. Putting together chains, prompted models and model tuning 
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When viewed through the MLOps lens, fine-tuning shares similar requirements with 
model training: 


1. The capability to track artifacts being part of the tuning job. This includes for example the 
input data or the parameters being used to tune the model. 


2. The capability to measure the impact of the tuning. This translates into the capability 
to perform evaluation of the tuned model for the specific tasks it was trained on and to 
compare results with previously tuned models or frozen models for the same task. 


Platforms like Vertex Al" (and the Google Cloud platform more broadly) provide a robust 

suite of services designed to address these MLOps requirements: Vertex Model Registry," 
for instance, provides a centralized storage location for all the artifacts created during the 
tuning job, and Vertex Pipelines streamlines the development and management of these 
tuning jobs. Dataplex,'* meanwhile, provides an organization-wide data fabric for data lineage 
and governance and integrates well with both Vertex Al and BigQuery."° What’s more, these 
products provide the same governance capability for both predictive and gen Al applications, 
meaning customers do not need separate products or configurations to manage generative 
versus Al development. 


Continuous Training & Tuning 


In machine learning operations (MLOps), continuous training is the practice of repeatedly 
retraining machine learning models in a production environment. This is done to ensure 
that the model remains up-to-date and performs well as real-world data patterns change 
over time. For gen Al models, continuous tuning of the models is often more practical than 
retraining from scratch due to the high data and computational costs involved. 
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The approach to continuous tuning depends on the specific use case and goals. For relatively 
static tasks like text summarization, the continuous tuning requirements may be lower. But 

for dynamic applications like chatbots that need constant human alignment, more frequent 
tuning using techniques like RLHF based on human feedback is necessary. 


To determine the right continuous tuning strategy, Al practitioners must carefully evaluate 
the nature of their use case and how the input data evolves over time. Cost is also a major 
consideration, as the compute infrastructure greatly impacts the speed and expense of 
tuning. We discuss in detail monitoring of GenAl systems in the Logging and Monitoring 
section of this whitepaper. 


Graphics processing units (GPUs) and tensor processing units (TPUs) are key hardware for 
fine-tuning. GPUs, known for their parallel processing power, are highly effective in handling 
the computationally intensive workloads and often associated with training and running 
complex machine learning models. TPUs, on the other hand, are specifically designed 

by Google for accelerating machine learning tasks. TPUs excel in handling large matrix 
operations common in deep learning neural networks. 


To manage costs, techniques like model quantization can be applied. This represents model 
weights and activations using lower-precision 8-bit integers rather than 32-bit floats, which 


reduces computational and memory requirements. 


We discuss in detail the support for tuning in Vertex Al in the Customize: Vertex Al Training & 
Tuning section. 
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Data Practices 


Traditionally, ML model behavior was dictated solely by its training data. While this still holds 
true for foundation models — trained on massive, multilingual, multimodal datasets — gen Al 
applications built on top of them introduce a new twist: model behavior is now determined by 
how you adapt the model using different types of input data (Figure. 9). 


A model behavior is determined by the data... 


..It was trained on ...[t was adapted on 


Pretraining Datasets Prompts 
(e.g. C4, The Pile, Proprietary data) 


Augmentation/Grounding Data 
(Websites, Docs, PDFs, DBs, APIs, etc.) 


Task-Specific Data for PEFT 


Task-Specific Evaluations 


Instruction Tuning Datasets 
Safety Tuning Datasets 


Human Preference Data, etc. Human Preferences Data, etc. 


FM Creation Application with FM 


Figure 9. Examples of data spectrum for foundation models - creation (left) vs. adaptation (right) 


The key difference between traditional predictive ML and gen Al lies in where you start. In 
predictive ML, the data is paramount. You spend a lot of time on data engineering, and if you 
don’t have the right data, you cannot build an application. Gen Al takes a unique approach to 
this matter. You start with a foundation model, some instructions and maybe a few example 
inputs (in-context learning). You can prototype and launch an application with surprisingly 
little data. 
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This ease of prototyping, however, comes with a challenge. Traditional predictive Al relies on 
apriori well-defined dataset(s). In gen Al, a single application can leverage various data types, 
from completely different data sources, all working together (Figure 10). Let’s explore some 
of these data types: 


* Conditioning prompts: These are essentially instructions given to the Foundation Model 
(FM) to guide its output, setting boundaries of what it can generate. 


+ Few-shot examples: A way to show the model what you want to achieve through input- 
output pairs. This helps the model grasp the specific task(s) at hand, and in many cases, it 
boosts performances. 


* Grounding/augmentation data: Data coming from either external APIs (like Google 
Search) or internal APls and data sources. This data permits the FM to produce answers 
for a specific context, keeping responses current, relevant without retraining the entire 
FM. This type of data also supports reducing hallucinations. 


* Task-specific datasets: These are used to fine-tune an existing FM for a particular task, 
improving its performance in that specific area. 


- Human preference datasets: These capture feedback on generated outputs, helping 
refine the model's ability to produce outputs that align with human preferences. 


- Full pre training corpora: These are massive datasets used to initially train foundation 
models. While application builders may not have access to them nor the tokenizers, 
the information encoded in the model itself will influence the application’s output 
and performance. 


This is not an exhaustive list. The variety of data used in gen Al applications is constantly 
growing and evolving. 
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Deploying, monitoring, logging, etc. 


Foundational Model(s) Adaptation 


PEFT tuning 
Prompt . 
Engineering Evaluations 
RLHF, DPO, etc. 


| 


Contextual data, Task-specific data, Task-specific 
guardrails, Human Preferences evaluation data, 

Augmentation/ data, etc. Adversarial testing 

Grounding Data, data, etc. 


Foundation 
Model(s) 


Embeddings, 
Vector Stores 
SQL DBs 
NoSQL DBs, 


Input-output 
APIs, etc. 


examples, etc. 


Data 


Figure 10. Example of high-level data and adaptations landscape for developing gen Al applications using 
existing foundation models 


This diverse range of data adds another complexity layer in terms of data organization, 
tracking and lifecycle management. Take a RAG-based application as an example: it might 
involve rewriting user queries, dynamically gathering relevant examples using a curated set 
of examples, querying a vector database, and combining it all with a prompt template. This 
involves managing multiple data types: user queries, vector databases with curated few-shot 
examples and company information, and prompt templates. 
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Each data type needs careful organization and maintenance. For example, the vector 
database requires processing data into embeddings, optimizing chunking strategies, and 
ensuring only relevant information is available. The prompt template itself needs versioning 
and tracking, the user queries need rewriting, etc. This is where traditional MLOps and 
DevOps best practices come into play, with a twist. We need to ensure reproducibility, 
adaptability, governance, and continuous improvement using all the data required in an 
application as a whole but also individually. Think of it this way: in predictive Al, the focus 
was on well-defined data pipelines for extraction, transformation, and loading. In gen Al, 
it's about building pipelines to manage, evolve, adapt and integrate different data types ina 
versionable, trackable, and reproducible way. 


As mentioned earlier, fine-tuning foundation models (FMs) can boost gen Al app 
performance, but it needs data. You can get this data by launching your app and gathering 
real-world data, generating synthetic data, or a mix of both. Using large models to generate 
synthetic data is becoming popular because it speeds things up, but it's still good to have a 
human check the results for quality assurance. Here are few ways to leverage large models 
for data engineering purposes: 


1. Synthetic data generation: This process involves creating artificial data that closely 
resembles real-world data in terms of its characteristics and statistical properties, often 
being done with a large and capable model. This synthetic data serves as additional 
training data for gen Al, enabling it to learn patterns and relationships even when labeled 
real-world data is scarce. 


2. Synthetic data correction: This technique focuses on identifying and correcting errors 
and inconsistencies within existing labeled datasets. By leveraging the power of larger 
models, gen Al can flag potential labeling mistakes and propose corrections, improving the 
quality and reliability of the training data. 
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3. Synthetic data augmentation: This approach goes beyond simply generating new 
data. It involves intelligently manipulating existing data to create diverse variations while 
preserving essential features and relationships. Thus, gen Al can encounter a broader 
range of scenarios during training, leading to improved generalization and ability to 
generate nuanced and relevant outputs. 


Evaluating gen Al, unlike predictive Al, is tricky. You don't usually know the training data 
distribution of the foundational models. Building a custom evaluation dataset reflecting your 
use Case is essential. This dataset should cover essential, average, and edge cases. Similar 
to fine-tuning data, you can leverage powerful language models to generate, curate, and 
augment data for building robust evaluation datasets. 


Evaluate 


Even if only prompt engineering is performed, as any experimental process, it does require 
evaluation in order to iterate and improve. This makes the evaluation process a core activity 
of the development of any gen Al systems. 


In the context of gen Al systems, evaluation might have different degrees of automation: from 
entirely driven by humans to entirely automated by a process. 


In the early days of a project, when you're still prototyping, evaluation is often a manual 
process. Developers eyeball the model's outputs, getting a qualitative sense of how it's 
performing. But as the project matures and the number of test cases balloons, manual 
evaluation becomes a bottleneck. That's when automation becomes key. 
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Automating evaluation has two big benefits. First, it lets you move faster. Instead of spending 
time manually checking each test case, you can let the machines do the heavy lifting. 

This means more iterations, more experiments, and ultimately, a better product. Second, 
automation makes evaluation more reliable. It takes human subjectivity out of the equation, 
ensuring that results are reproducible. 


But automating evaluation for gen Al comes with its own set of challenges. 


For one, both the inputs (prompts) and outputs can be incredibly complex. A single prompt 

might include multiple instructions and constraints that the model needs to juggle. And the 

outputs themselves are often high-dimensional - think a generated image or a block of text. 
Capturing the quality of these outputs in a simple metric is tough. 


There are some established metrics, like BLEU for translations and ROUGE for summaries, 
but they don't always tell the full story. That's where custom evaluation methods come in. 
One approach is to use another foundational model as a judge. For example, you could 
prompt a large language model to score the quality of generated texts across various 
dimensions. This is the idea behind techniques like AutoSxS." 


Another challenge is the subjective nature of many evaluation metrics for gen Al. What 
makes one output ‘better’ than another can often be a matter of opinion. The key here is to 
make sure your automated evaluation aligns with human judgment. You want your metrics 
to be a reliable proxy for what people would think. And to ensure comparability between 
experiments, it's crucial to lock down your evaluation approach and metrics early in the 
development process. 


Lack of ground truth data is another common hurdle, especially in the early stages of a 


project. One workaround is to generate synthetic data to serve as a temporary ground truth, 
which can be refined over time with human feedback. 
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Finally, comprehensive evaluation is essential for safeguarding gen Al applications against 
adversarial attacks. Malicious actors can craft prompts to try to extract sensitive information 
or manipulate the model's outputs. Evaluation sets need to specifically address these attack 
vectors, through techniques like prompt fuzzing (feeding the model random variations on 
prompts) and testing for information leakage. 


Automating the evaluation process ensures speed, scalability and reproducibility 


An automation of the evaluation process can be considered a proxy for the 
human judgmen 


Depending on the use case, the evaluation process will require a high degree 
of customization. 


To ensure comparability it is essential to stabilize the evaluation approach, metrics, 
and ground truth data as early as possible in the development phase. 


It is possible to generate synthetic ground truth data to accommodate for the lack of 
real ground truth data. 


It is important to include test cases of adversarial prompting as part of the evaluation 
set to test the reliability of the system itself for these attacks. 


Table 1. Key suggestions to approach evaluation of gen Al systems 
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Deploy 


It should be clear by this point that production gen Al applications are complex systems with 
many interacting components. Some of the common components discussed include multiple 
prompts, models, adapter layers and external data sources. In deploying a gen Al system to 
production, all these components need to be managed and coordinated with the previous 
stages of gen Al system development. Given the novelty of these systems, best practices 
for deployment and management are still evolving, but we can discuss observations and 
recommendations for these components and indicate how to address the major concerns. 


Deploying gen Al solutions necessarily involves multiple steps. For example, a single 
application might utilize several large language models (LLMs) alongside a database, all 
fed by a dynamic data pipeline. Each of these components potentially requires its own 
deployment process. 


For clarity, we distinguish between two main types of deployment: 


1. Deployment of gen Al systems: This focuses on operationalizing a complete system 
tailored for a specific use case. It encompasses deploying all the necessary elements 
- the application, chosen LLMs, database, data pipelines, and any other relevant 
components - to create a functioning end-user solution. 


2. Deployment of foundational models: This applies to open-weight models, where the 
model weights are publicly available on platforms like Vertex Model Garden or Hugging 
Face, or privately trained models. Deployment in this scenario centers around making 
the foundational model itself accessible to users. Given their multipurpose nature, these 
deployments often aim to support various potential use cases. 
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Deployment of gen Al systems 


Deployment of gen Al systems is broadly similar to deployment of any other complex 
software system. Most of the system components - databases, Python applications, etc. - 
are also found in other non-gen Al applications. As a result, our general recommendation is 
to manage these components using standard software engineering practices such as version 
control” and Continuous Integration / Continuous Delivery (CI/CD)." 


Version control 


Gen Al experimentation is an iterative process involving repeated cycles of development, 
evaluation, and modification. To ensure a structured and manageable approach, it's crucial to 
implement strict versioning for all modifiable components. These components include: 


- Prompt templates: Unless leveraging specific prompt management solutions, version 
them through standard version control tools like Git. 


+ Chain definitions: The code defining the chain (including API integrations, database calls, 
functions, etc.) should also be versioned using tools like Git. This provides a clear history 
and enables easy rollback if needed. 


+ External datasets: In retrieval augmented generation (RAG) systems, external datasets 
play a key role. It’s important to track these changes and versions of these datasets for 
reproducibility. You can do that by leveraging existing data analytics solutions such as 
BigQuery, AlloyDB, Vertex Feature Store. 


- Adapter models: The landscape of techniques like LoRA tuning for adapter models is 
constantly evolving. . You can leverage established data storage solutions (e.g. cloud 
storage) to manage and version these assets effectively. 
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Continuous integration of gen Al systems 


In a continuous integration framework, every code change goes through automatic testing 
before merging to catch issues early. Here, unit and integration testing are key for quality 
and reliability. Unit tests act like a microscope, zooming in on individual code pieces, while 
integration testing verifies that different components work together. 


The benefits of continuous integration in traditional software development are well- 
understood. Implementing a Cl system helps to do the following: 


1. Ensure reliable, high-quality outputs: Rigorous testing increases confidence in the 
system's performance and consistency. 


2. Catch bugs early: Identifying issues through testing prevents them from causing bigger 
problems downstream. It also makes the system more robust and resilient to edge cases 
and unexpected inputs. 


3. Lower maintenance costs: Well-documented test cases simplify troubleshooting and 
enable smoother modifications in the future, reducing overall maintenance efforts 


These benefits are applicable to gen Al Systems as much as any software product. 


Continuous Integration should be applied to all elements of the system, including the prompt 


templates, the chain and chaining logic, and any embedding models and retrieval systems. 


However, applying Cl to gen Al comes with challenges: 


1. Difficult to generate comprehensive test cases: The complex and open-ended nature of 
gen Al outputs makes it hard to define and create an exhaustive set of test cases that 
cover all possibilities. 
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2. Reproducibility issues: Achieving deterministic, reproducible results is tricky since 
generative models often have intrinsic randomness and variability in their outputs, even for 
identical inputs. This makes it harder to consistently test for specific expected behaviors. 


These challenges are closely related to the broader question of how to evaluate gen Al 
systems. Many of the same techniques discussed in the Evaluation section above can also 
be applied to the development of Cl systems for gen Al. This is an ongoing area of research, 
however, and more techniques will undoubtedly emerge in the near future. 


Continuous delivery of gen Al systems 


Once the code is merged, a continuous delivery process begins to move the built and tested 
code through environments that closely resemble production for further testing before the 
final deployment. 


As mentioned in the "Develop and Experiment"'" segment, chain elements become one 
of the main components to deploy, as they fundamentally constitute the gen Al application 


serving users. 


The delivery process of the gen Al application containing the chain may vary depending on 
the latency requirements and whether the use case is batch or online: 


1. Batch use cases require deploying a batch process executed on a schedule in production. 
The delivery process should focus on testing the entire pipeline in integration in an 
environment close to production before deployment. As part of the testing process, 
developers can assert specific requirements around the throughput of the batch process 
itself and checking that all components of the application are functioning correctly (e.g., 
permissioning, infrastructure, code dependencies). 
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2. Online use cases require deploying an API, in this case, the application containing the 
chain, capable of responding to users at low latency. The delivery process should involve 
testing the API in integration in an environment close to production, with tests to assert 
that all components of the application are functioning correctly (e.g., permissioning, 
infrastructure, code dependencies). Non-functional requirements (e.g., scalability, 
reliability, performance) can be verified through a series of tests, including load tests. 


Deployment of foundation models 


Because foundation models are so large and complex, deployment and serving of these 
models raises a number of issues — most obviously, the compute and storage resources 
needed to run these massive models successfully. At a minimum, a foundation model 
deployment needs to include several key considerations: selecting and securing necessary 
compute resources, such as GPUs or TPUS; choosing appropriate data storage services 
like BigQuery or Google Cloud Storage that can scale to deal with the large datasets; and 
implementing model optimization or compression techniques. 


Infrastructure validation 


One technique that can be applied to address the resource requirements of gen Al systems is 
infrastructure validation. This refers to the introduction of an additional verification step, prior 
to deploying the training and serving systems, to check both the compatibility of the model 
with the defined serving configuration and the availability of the required hardware. There 
are a number of optional infrastructure validation layers that can perform some of these 
checks automatically. For instance, TFX"’ has an infrastructure validation layer that checks 
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whether the model will run correctly on a specified hardware configuration, which can help 
catch configuration issues before deployment. Nevertheless, the availability of the required 
hardware still needs to be verified by hand by the engineer or the system administrator. 


Compression and optimization 


Another way of addressing infrastructure challenges is to optimize the model itself. 
Compressing and/or optimizing the model can often significantly reduce the storage and 
compute resources needed for training and serving, and in many cases can also decrease 
the serving latency. 


Some techniques for model compression and optimization include quantization, distillation 
and model pruning. Quantization reduces the size and computational requirements of the 
model by converting its weights and activations from higher-precision floating-point numbers 
to lower-precision representations, such as 8-bit integers or 16-bit floating-point numbers. 
This can significantly reduce the memory footprint and computational overhead of the model. 
Model Pruning is a technique for eliminating unnecessary weight parameters or by selecting 
only important subnetworks within the model. This reduces model size while maintaining 
accuracy as high as possible. Finally, distillation trains a smaller model, using the responses 
generated by a larger LLM, to reproduce the output of the larger LLM for a specific domain. 
This can significantly reduce the amount of training data, compute, and storage resources 
needed for the application. 


In certain situations, model distillation can also improve the performance of the model itself 
in addition to reducing resource requirements. This happens because the smaller model can 
combine the knowledge of the larger model with labeled data, which can help it to generalize 
better to new data on a limited use case.The process of distillation usually involves training 

a large foundational LLM (teacher model) and having it generate responses to certain tasks, 


September 2024 35 


Operationalizing Generative Al on Vertex Al using ML Ops 


and then having the smaller LLM (student model) use a combination of the LLMs knowledge 
as well as task specific supervised dataset to learn. The size and complexity of the smaller 
LLM can be adjusted to achieve the desired trade-off between performance and resource 
requirements. A technique known as step-by-step distillation?° has proven to achieve 

great results. 


Deployment, packaging, and serving checklist 


Following are the important steps to take when deploying a model on Vertex Al. 


O Configure version control: Implement version control practices for LLM deployments. 
This allows you to roll back to previous versions if necessary and track changes made to 
the model or deployment configuration. 


O Optimize the model: Perform any model optimization (distillation, quantization, pruning, 
etc.) before packaging or deploying the model. 


O Containerize the model: Package the trained LLM model into a container. 


O Define target hardware requirements: Ensure the target deployment environment 
meets the requirements for optimal performance of the LLM model, such as GPUs, as well 
as TPUs and other specialized hardware accelerators. 


O Define model endpoint: Define the endpoint configuration using Vertex Al's endpoint 
creation interface or the Vertex Al SDK. Specify the model container, input and output 
formats, and any additional configuration parameters. 


1 Allocate resources: Allocate the appropriate compute resources for the endpoint based 
on the expected traffic and performance requirements. 
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O Configure access control: Set up access control mechanisms to restrict access to 
the endpoint based on authentication and authorization policies. This ensures that only 
authorized users or services can interact with the deployed LLM. 


O Create model endpoint: Create a Vertex Al endpoint to deploy”' the LLM as a REST API 
service. This allows clients to send requests to the endpoint and receive responses from 
the LLM.. 


O Configure monitoring and logging: Establish monitoring and logging systems to track 
the endpoint's performance, resource utilization, and error logs. 


O Deploy custom integrations: Integrate the LLM into custom applications or services 
using the model's SDK or APIs. This provides more flexibility for integrating the LLM into 
specific workflows or frameworks. 


O Deploy Real-time Applications: For real-time applications, consider using Cloud 
Functions and Cloud Run in combination with LLMs hosted in Vertex Al to create a 
streaming pipeline that processes data and generates responses in real time. 


Logging and monitoring 


Monitoring gen Al applications and, as a result, their components, presents unique 
challenges that require additional techniques and approaches on top of those in traditional 
MLOps. The use of gen Al requires the chaining of components in order to produce results 
for practical applications. Additionally, to your application user, all the components are 
hidden. Therefore, the interface they have to your application is their input and the final 
output. This creates the need to log and monitor your application end-to-end: that is, logging 
and monitoring the input and output of your application overall as well as the input and 
output of every single component. 
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Logging is necessary for applying monitoring and debugging on your gen Al system in 
production. An input to the application triggers multiple components. Imagine the output 

to a given input is factually inaccurate. How can you find out which of the components are 
the ones that didn’t perform well? To answer this question it is necessary to apply logging 
on the application level and at the component level. We need lineage in our logging for all 
components executed. For every component we need to log their inputs and outputs. We 
also need to be able to map those with any additional artifacts and parameters they depend 
on so we Can easily analyze those inputs and outputs. 


Monitoring can be applied to the overall gen Al application and to individual components. We 
prioritize monitoring at the application level. This is because if the application is performant 
and monitoring proves that, it implies that all components are also performant. You can also 
apply the same practices to each of the prompted model components to get more granular 
results and understanding of your application. 


Skew detection in traditional ML systems refers to training-serving skew that occurs when 
the feature data distribution in production deviates from the feature data distribution 
observed during model training. In the case of Gen Al systems using pretrained models in 
components chained together to produce the output, we need to modify our approach. We 
can measure skew by comparing the distribution of the input data we used to evaluate our 
application (the test set as described under the Data Curation and Principles section above) 
and the distribution of the inputs to our application in production. Once the two distributions 
drift apart,further investigation is needed. The same process can be applied to the output 
data as well. 
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Figure 11. Drift/skew detection process overview 


Like skew detection, the drift detection process checks for statistical differences between 
two datasets. However, instead of comparing evaluations and serving inputs, drift looks for 
changes in input data. This allows you to check how the inputs and therefore the behavior of 
your users changed over time. This is the same as traditional MLOps. 


Given that the input to the application is typically text, there are a few approaches to 
measuring skew and drift. In general all the methods are trying to identify significant 
changes in production data, both textual (size of input) and conceptual (topics in input), 
when compared to the evaluation dataset. All these methods are looking for changes that 
could potentially indicate the application might not be prepared to successfully handle the 
nature of the new data that are now coming in. Some common approaches are calculating 
embeddings and distances, counting text length and number of tokens, and tracking 
vocabulary changes, new concepts and intents, prompts and topics in datasets, as well 

as Statistical approaches such as least-squares density difference,?? maximum mean 
discrepancy (MMD),?3 learned kernel MMD,” or context-aware MMD.* As gen Al use cases 
are so diverse, it is often necessary to create additional custom metrics that better capture 
abnormal changes in your data. 
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Continuous evaluation is another common approach to GenAl application monitoring. In 

a continuous evaluation system, you capture the model's production output and run an 
evaluation task using that output, to keep track of the model's performance over time. One 
approach is collecting direct user feedback, such as ratings (for example thumbs up/down), 
which provides immediate insight into the perceived quality of outputs. In parallel, comparing 
model-generated responses against established ground truth, often collected through 
human assessment or as a result of an ensemble Al Model approach, allows for deeper 
analysis of performance. Ground truth metrics can be used to generate evaluation metrics 

as described in the Evaluation section. This process provides a view on how your evaluation 
metrics changed from when you developed your model to what you have in production today. 


As with traditional monitoring in MLOps an alerting process should be deployed for notifying 
application owners when a drift, skew or performance decay from evaluation tasks is 
detected. This can help you promptly intervene and resolve issues. This is achieved by 
integrating alerting and notification tools into your monitoring process. 


Monitoring expands beyond drift, skew and evaluation tasks. Monitoring in MLOps includes 
efficiency metrics like resources utilization and latency. Efficiency metrics are as relevant and 
important in gen Al as they are in any other Al application. 


Vertex Al provides a set of tools that can help with monitoring. Model Evaluation for gen Al?é 
tasks can be used for classification, summarization, question answering, and text generation 
tasks. Vertex Pipelines can be used to allow the recurrent execution of evaluation jobs in 
production as well as running pipelines for skew and drift detection processes. 


September 2024 40 


Operationalizing Generative Al on Vertex Al using ML Ops 


Govern 


In the context of MLOps governance encompasses all the practices, and policies that 
establish control, accountability, and transparency over the development, deployment, and 
ongoing management of machine learning (ML) models, including all the activities related to 
the code, data and models lifecycle. 


As mentioned in the Develop & Experiment section the chain element and the relative 
components become a new type of assets that need to be governed over the full lifecycle 
from development to deployment, to monitoring. 


The governance of the chain element lifecycle extends to lineage tracking practices as well. 


While for predictive Al systems lineage focuses on tracking and understanding the complete 
journey of a machine learning model, in gen Al, lineage goes beyond the model artifact 
extending to all the components in the chain. This includes the data and models used and 
their lineage, the code involved and the relative evaluation data and metrics. This can help 
auditing, debugging and improvements of the models 


Along with these new practices, existing MLOps and DevOps practices still apply to MLOps 
for gen Al: 

1. The need to govern the data lifecycle; see “Data Practices”. 

2. The need to govern the tuned model lifecycle; see “Tuning and Training”. 


3. The need to govern the code lifecycle; see “Deployment of GenAl 
System components”. 
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The next segment will introduce a set of products that allow developers to perform 
governance of the data, model and code assets. We will discuss products like Google 
Cloud Dataplex, which centralizes the governance of model and data, Vertex ML Metadata 
and Vertex Experiment, which allows developers to register experiments, their metrics 
and artifacts. 


The role of an Al platform for gen 
Al operations 


Alongside the explosion of both predictive and gen Al applications, Al platforms, like Vertex 
Al," have emerged as indispensable tools for organizations seeking to leverage the power of 
Artificial Intelligence (Al). These comprehensive platforms provide a unified environment that 
streamlines the entire Al lifecycle, from data preparation and model training to deployment, 
automation, continuous integration/continuous delivery (CI/CD), governance, and monitoring. 


At the heart of an Al platform lies its ability to support diverse Al development needs. 
Whether you seek to utilize pre-trained Al solutions, adapt existing models through tuning 
or transfer learning, or embark on training your own large models, Al platforms provide the 
infrastructure and tools necessary to support these journeys. The advent of these platforms 
has revolutionized the way organizations approach Al, enabling them to productionize Al 
applications in a secure, enterprise-ready, responsible, controlled and scalable manner. 
These platforms accelerate innovation as well as foster reproducibility and collaboration 
while reducing costs and maximizing Return on Investment (ROI). 
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The new gen Al paradigm discussed in prior sections demands a robust and reliable Al 
platform that can seamlessly integrate and orchestrate a wide range of functionalities. 
These functionalities include model tuning for specific tasks; leveraging paradigms like 
retrieval augmented generation? (RAG) to connect to internal and external data sources; 
and pre-training or instruction fine-tuning large models from scratch. Complex applications 
also often require chaining with other models, such as classifiers to route inputs to the 
appropriate LLM/ML model, extraction of customer information from a knowledge base, 
inclusion of safety checks, or even creation of caching systems for cost optimization. 


Vertex Al Agent Builder 


OOTB and custom Agents | Search 
Orchestration | Extensions | Connectors | Document Processors | Retrieval engines | Rankers | Grounding 


Vertex Al Vertex Al Model Builder 


Prompt | Serve | Tune | Distill | Eval | Notebooks | Training | Feature Store | Pipelines | Monitoring 


Vertex Al Model Garden 


Google | Open | Partner 


Figure 12. Key components of Vertex Al for gen Al 


Key components of Vertex Al for gen Al 


Vertex Al eliminates the complexities of managing the entire infrastructure required for Al 
development and deployment. Instead, Vertex Al offers a user-centric approach, providing 
on-demand access to the needed resources. This flexibility empowers organizations to 
focus on innovation and collaboration, rather than infrastructure management, and up- 
front hardware purchase. The features of Vertex Al that support gen Al development can be 
grouped into eight areas. 
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Discover: Vertex Model Garden 


As discussed before, there is already a wide variety of available foundation models, trained 
on a broad range of datasets, and the cost of training a new foundation model can be 
prohibitive. Thus it often makes sense for companies to adapt existing foundation models 
rather than creating their own from scratch. As a result, a platform facilitating seamless 
discovery and integration of diverse model types is critical. 


Vertex Al Model Garden’ supports these needs, offering a curated collection of over 

150 Machine Learning and gen Al models from Google, Google partners, and the open- 
source community. It simplifies the discovery, customization, and deployment of both 
Google’s proprietary foundational models and diverse open-source models across a 

vast spectrum of modalities, tasks, and features. This comprehensive repository permits 
developers to leverage the collective research on artificial intelligence models within a single 
streamlined environment. 


Model Garden encompasses a diverse range of modalities such as Language, Vision, Tabular, 
Document, Speech, Video, and Multimodal data. This broad coverage enables developers 

to tackle a multitude of tasks, including generation, classification, regression, extraction, 
recognition, segmentation, tracking, translation, and embedding. Model Garden houses 
Google’s proprietary and foundational models (like Gemini,?” PaLM 2,7° Imagen?’) alongside 
numerous popular open source and third-party partner models like like Llama 3,°° T5 Flan,*" 
BERT,*? Stable Diffusion,** Claude 3 (Anthropic),*4 and Mistral Al.*° Additionally, it offers task- 
specific models for occupancy analysis, watermark detection, text-moderation, text-to-video, 
hand-gesture recognition, product identification, and tag recognition, among others. Every 
model*¢ in Vertex Model Garden has a model card which includes a description of the model, 
the main use cases that can cover, and the option (if available) to tune the model or deploy 

it directly. 
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Model Garden fosters experimentation by facilitating access to Google’s proprietary 
foundational models through the Vertex Al Studio UI,?’ a playground where you can play 
around with prompts, models, and open-source models using provided Colab notebooks. 
One-click deployment is available for some external models, and there are more than 40 
models available for fine-tuning for specific needs. Furthermore, the platform allows users to 
leverage technologies like vLLM* and quantization techniques for optimizing deployments for 
efficiency and reduced costs. We present below an overview of some of the models in Model 
Garden. For an up-to-date list, please visit.%° 
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Model Type 


First-party models 


First-party models 


Open models 


Third-party models 


Description 
Foundation models 


Leverage multimodal models 
from Google across vision, 
dialog, code generation, and 
code completion. 


Pre-trained APIs 


Build and deploy Al 
applications faster with our 
pre-trained APIs powered by 
the best Google Al research 
and technology. 


Open source models 


Access a wide variety of 
enterprise-ready open 


source models 


Third-party models 


Model Garden will support 
third-party models 
from partners with 
foundation models. 


Details 
Gemini®? and Palm2*° 
Imagen for text-to-image” 


Codey for code generation 
and completion’? 


Chirp for speech-to-text*® 
Text-to-Speech*4 

Natural Language processing*® 
Translation*¢ 

Vision’ 

Google’s Gemma,“ PaliGemma,"® 
CodeGemma’”? 

Meta's Llama*° 

TIl's Falcon®° 

Mistral Al® 


BERT,*2 T-5 FLAN, *" 
ViT, © EfficientNet®? 


Anthropic’s Claude 3 Haiku, 
Sonnet and Opus™*55 


Table 2. An overview of some of the models in Model Garden [Last Updated: March 18th, 2024] 
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Prototype: Vertex Al Studio & Notebooks 


Rapid development and prototyping capabilities are also essential for developing gen Al 
applications. Vertex Al prioritizes inclusivity and flexibility in its development environments, 
catering to a wide range of developer preferences and proficiency levels. This platform 
provides options for both console-driven and programmatic development workflows. Users 
can leverage the intuitive web interface for end-to-end application creation or utilize various 
APIs for deeper customization and control. These include the REST API°* and dedicated 
SDKs for Python,®” NodeJS® and Java,*’ ensuring compatibility with diverse programming 
languages and ecosystems. Developers can choose to use the tools and IDEs of their 
choice for interacting with the platform, or take advantage of Vertex-native tools like Vertex 
Colab Enterprise or Vertex Workbench to explore and experiment with code within familiar 
notebook environments. 


Vertex Al Studio® provides a unified console-driven entry point to access and leverage the 
full spectrum of Vertex Al's gen Al services. It facilitates exploration and experimentation with 
various Google first party foundation models (for example, PaLM 2, Gemini, Codey, Imagen, 
and Universal Speech Model). Additionally, it offers prompt examples and functionalities 

for testing distinct prompts and models with diverse parameters. It’s also possible to adapt 
existing models through various techniques like supervised fine-tuning (SFT), reinforcement 
learning tuning techniques, and Distillation, and deploy gen Al applications in just a few 
clicks. Vertex Al Studio considerably simplifies and democratizes gen Al adoption, catering 

to a variety of users, from business analysts to machine learning engineers. You can see the 
homepage of Vertex Al Studio in Figure 13. 
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Figure 13. Vertex Al Studio - Homepage 


Customize: Vertex Al training & tuning 


While prompt engineering and augmentation are sufficient for some gen Al use cases, other 
cases require training, tuning and adapting the models to get the best results. Vertex Al 
provides a comprehensive platform for training and adapting LLMs, supporting a range of 
techniques and approaches from prompt engineering to training models from scratch. 
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Train 


For full-scale LLM training, TPUs and GPUs are vital because of their superior processing 
power and memory capacity compared to CPUs. GPUs excel at parallel processing, enabling 
faster model training. TPUs, specifically designed for machine learning tasks, offer even 
faster processing and higher energy efficiency. This makes them ideal for large-scale, 
complex models. Google Cloud provides a range of offerings to support LLM training, 
including TPU VMs with various configurations, pre-configured Al platforms like Vertex Al, 
and dedicated resources like Cloud TPU Pods for scaling up training. These offerings allow 
users to choose the right infrastructure for their needs, accelerating LLM development and 
enabling cutting-edge research and applications. 


Tune 


Vertex Al also provides a comprehensive solution for adapting pre-trained LLMs. It supports 
a spectrum of techniques from a non-technical prompt engineering playground at inference 
time, to data-driven approaches involving tuning, reinforcement learning and distillation 
methods during the development or adaptation phase. The following five techniques — many 
of which are unique to Vertex Al — enable users to explore and implement them effectively. 
This applies to both proprietary and open-source LLMs, allowing you to achieve superior 
results while optimizing for costs and latency requirements. 


- Prompt engineering“ leverages carefully crafted natural language prompts, potentially 
chained and enriched with external knowledge and examples, to nudge the LLM towards 
desired outputs without necessitating further training. Vertex Al through Vertex Al Studio 
offers a dedicated playground for crafting, testing, comparing and managing diverse 
prompts and techniques. Users can access various pre-built prompt templates within the 
platform and leverage public prompting guidelines® for Google’s proprietary large models. 
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- Supervised fine-tuning (SFT) on Vertex Al facilitates model adaptation by leveraging a 
set of labeled examples (even a few hundred is enough) to tune a model on specific tasks 
and contexts within domain-specific datasets. The required examples resemble the one- 
shot example structure employed in the construction of a prompt. This effectively extends 
the few-shot learning approach for enhanced optimization. This focused tuning enables 
the model to encode additional parameters in the model necessary for mimicking desired 
behaviors such as improved complex prompt comprehension, adaptation to specific 
output formats, correcting errors, and learning new tasks. The SFT tuning approach on 
Vertex Al, minimizes computational overhead and time while yielding an updated model 
that integrates the newly acquired parameters with the original model’s core parameters. 


- Reinforcement learning with human feedback (RLHF),“‘ available on Vertex Al for 
foundational models like PaLM 2,and open-source models like T5 (s-xxl) and Llama2, 
leverages human feedback to train large models to align with human preferences. This 
technique is well-suited in complex tasks involving preference modeling and optimizes 
LLMs on intricate, sequence-level objectives not easily addressed by traditional 
supervised fine-tuning. The process involves first training a reward model using a human 
preference dataset, then utilizing it to score the output from the LLM, and finally applying 
reinforcement learning to optimize the LLM. This approach is recognized as a key driver of 
success in conversational large language models. 


- Distillation step-by-step” is an advanced distillation technique transferring knowledge 
from a significantly larger model (known as teacher model) to a smaller task-specific 
model (known as student model), preserving important information while reducing model 
size. Step-by-Step Distillation?° surpasses common techniques by requiring significantly 
less data. This method, accessible on Vertex Al,°° significantly reduces inference costs and 
latencies while minimizing performance impact in the resulting smaller LLM.® 
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Orchestrate 


Any training or tuning job you run can be orchestrated and then operationalized using Vertex 
Pipelines,® a service that aims to simplify and automate the deployment, management, and 
scaling of your ML workflows. 


It provides a platform for building, orchestrating, scheduling and monitoring complex and 
custom ML pipelines, enabling you to efficiently translate your models from prototypes 
to production. 


Vertex Pipelines is also the platform behind all the managed tuning and evaluation services 
for the Google Foundation Models on Vertex Al. This ensures consistency as you can 
consume and extend those pipelines easily, without having to familiarize yourself with 
many services. 


Getting started with Vertex Pipelines is simple: you define the pipeline’s step sequence in 


a Python file utilizing Kubeflow SDK.” For further details and comprehensive onboarding, 
consult the official documentation. 
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Chain & Augment: Vertex Al Grounding, Extensions, and RAG 
building blocks 


Beyond training, tuning and adapting models and prompts directly, Vertex Al offers a 
comprehensive ecosystem for augmenting LLMs, to address the challenges of factual 
grounding and hallucination. The platform incorporates emerging techniques like RAG and 
agent-based approaches. 


RAG overcomes limitations by enriching prompts with data retrieved from vector databases, 
circumventing pre-training requirements and ensuring the integration of up-to-date 
information. Agent-based approaches, popularized by ReAct prompting, leverage LLMs as 
mediators interacting with tools like RAG systems, APls, and custom extensions. Vertex Al 
facilitates this dynamic information source selection, enabling complex queries, real-time 
actions, and the creation of multi-agent systems connected to vast information networks for 
sophisticated query processing and real-time decision-making. 


Vertex Al function calling’? empowers users by enhancing the capabilities of language 
models (LLMs). It enables LLMs to access real-time data and interact with external systems, 
providing users with more accurate and up-to-date information. To do that, users need to 
provide function definitions such as description, inputs, outputs to the gen Al model. Instead 
of directly executing functions, the LLM intelligently analyzes user requests and generates 
structured data outputs. These outputs propose which function to call and what arguments 
to use. 


Vertex Al Grounding? helps users connect large models with verifiable information by 
grounding them to internal data corpora on Vertex Al Agent Builder” or external sources 
using Google Search. This enables two key functionalities: verifying model-generated outputs 
against internal or external sources and creating RAG systems using Google’s advanced 
search capabilities that produce quality content grounded in your own or web search data. 
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Vertex Al extensions? let developers integrate Vertex Foundation Models with real-time 
data and real-world actions through APIs and functions, enabling task execution and allowing 
enhanced capabilities. This extends to leveraging 1st party extensions like Vertex Al Search’ 
and Code Interpreter,” or 3rd party extensions for triggering and completing transactions. 
Imagine building an application that leverages the LLM's knowledge to plan a trip and 
seamlessly utilizes internal APIs to book hotels and flights, all within a single interface. 
Additionally, Vertex Extensions facilitate function calling with the gemini-pro model, enabling 
you to generate descriptions, pass them to the large model, receive JSON with function 
arguments, and automatically call the function. 


Vertex Al Agent Builder” is an out-of-the-box solution that allows you to quickly build gen 
Al agents, to be used as conversational chatbots or as part of a search engine. With Vertex 
Al Agent Builder, you are be able to easily ground your agents by pointing to a diverse range 
of data sources, including structured datastores such us BigQuery, Spanner, Cloud SQL, 
unstructured sources like website content crawling and cloud storage as well as connectors 
to Google drive and other APls. Agent Builder utilizes a robust foundation of Google Search 
technologies, encompassing semantic search, content chunking, ranking, algorithms, 

and user intent understanding. Under the hood it optimizes document loading, chunking, 
embedding models, and ranking strategies. It abstracts away these complexities and allows 
users to simply specify their data source to initiate the gen Al-powered agent.This approach 
is ideal for organizations seeking to build robust search experiences for standard use cases 
without extensive technical expertise. 


Vector databases are specialized systems for managing multi-dimensional data. This data, 
encompassing images, text, audio, video, and other structured or unstructured formats, 

is represented as vectors capturing its semantic meaning. Vector databases accelerate 
searching and retrieval within these high-dimensional spaces, enabling efficient tasks like 
finding similar images from billions or extracting relevant text snippets based on various 
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inputs. For a deeper dive into these topics, refer to 4 and 19. Vertex Al offers three flexible 
solutions for storing and serving embeddings at scale, catering to diverse use cases and 
user profiles. 


Vertex Al Vector Search’ is a highly scalable low-latency similarity search and fully 
managed vector database scaling to billions of vector embeddings with auto-scaling. This 
technology, built upon ScaNN” (a Google-developed technology used in products like 
Search, YouTube, and Play), allows you to search from billions of semantically similar or 
related items within your stored data. In the context of gen Al, the most common use cases 
where Vertex Vector Search can be used are: 


1. Finding similar items (either text or image) based solely on their semantic meaning, in 
conjunction with an embedding model. 


2. Creating a hybrid search approach that combines semantic and keyword or metadata 
search to refine the results. 


3. Extracting relevant information from the database to feed into LLMs, enabling them to 
generate more accurate and informed responses. 


Vertex Al Vector Search primarily functions as a vector database for storing pre-generated 
embeddings. These embeddings must be created beforehand using separate models like 
Vertex Embedding models” (namely textembedding-gecko, text-embedding-gecko- 
multilingual, or multimodalembedding). Choosing Vertex Vector Search is optimal 
when you require control over aspects like the chunk, retrieval, query and models strategy. 
This includes fine-tuning an embedding model for your specific data. However, if your use 
case is a Standard one requiring little customization, a readily available solution like Vertex 
Search might be a better choice. 
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Vertex Al Feature Store” is a centralized and fully managed repository for ML features 
and embedding. It enables teams to share, serve, and reuse machine learning features and 
embeddings effortlessly alongside other data. Its native BigQuery” integration eliminates 
duplication, simplifies lineage tracking and preserves data governance. Vertex Al Feature 
Store supports offline retrieval and an easy and fast online serving for machine learning 
features and embeddings. Vertex Al Feature Store is a good choice when you want to iterate 
and maintain different embedding versions alongside other machine learning features in a 
single place. 


Vertex Al offers the flexibility to seamlessly create and connect various products to build 
your own custom grounding, RAG, and Agent systems. This includes utilizing diverse 
embedding models (multimodal, multilingual), various vector stores (Vector Search, Feature 
Store) and search engines like Vertex Al Agent Builder, extensions, grounding, and even SQL 
query generation for complex natural language queries. Moreover, Vertex Al provides SDK 
integration with LangChain’ to easily build and prototype applications using the umbrella 

of Vertex Al products. For further details and integration information, consult the official 
documentation” and official examples.” 


Evaluate: Vertex Al Experiments, Tensorboard, & 
evaluation pipelines 


In the dynamic world of gen Al, experimentation and evaluation are the cornerstones of 
iterative development and continuous improvement. With a multitude of variables influencing 
Gen Al models (prompt engineering, model selection, data interaction, pretraining, 

and tuning), evaluation goes hand-in-hand with experimentation. The more seamlessly 
experiments and evaluations can be integrated into the development process, the 
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smoother and more efficient the overall development becomes. Vertex Al provides cohesive 
experimentation and evaluation products permitting connected iterations over applications 
and models alongside their evaluations. 


Experiment 


The process of selecting, creating, and customizing machine learning (including large 
models) and its applications involves significant experimentation, collaboration, and iteration. 


Vertex Al seamlessly integrates experimentation and collaboration into the development 
lifecycle of AI/ML and gen Al models and applications. Its Workbench Instances” provide 
Jupyter-based development environments for the entire data science workflow, connected 
to other Google Cloud services and with GitHub synchronization capabilities. Vertex Colab 
Enterprise” accelerates the Al workflow by enabling collaborative coding and leveraging 
code completion and generation features. 


Vertex Al also provides two tools for tracking and visualizing the output of many experiment 
cycles and training runs. Vertex Al Experiments” facilitates meticulous tracking and 
analysis of model architectures, hyperparameters, and training environments. It logs 
experiments, artifacts, and metrics, enabling comparison and reproducibility across multiple 
runs. This comprehensive tracking permits data scientists to select the optimal model 

and architecture for their specific use case. Vertex Al TensorBoard®° complements the 
experimentation process by providing detailed visualizations for tracking, visualizing, and 
sharing ML experiments. It offers a range of visualizations, including loss and accuracy 
metrics tracking, model computational graph visualization, and weight and bias histograms, 
which - for example - can be used for tracking various metrics pertaining to training and 
evaluation of gen Al models with different prompting and tuning strategies. It also projects 
embeddings to lower-dimensional space, and displays image, text, and audio samples. 
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Evaluation 


Vertex Al also provides a comprehensive set of evaluation tools for gen Al, from ground truth 
metrics to using LLMs as raters. 


For Ground Truth-based metrics, Automatic Metrics in Vertex Al®' lets you evaluate a model 
based on a defined task and “ground truth” dataset. For LLM-based evaluation, Automatic 
Side by Side (Auto SxS) in Vertex Al®* uses a large model to evaluate the output of multiple 
models or configurations being tested, helping to augment human evaluation at scale. 


In addition to that, users can also leverage Rapid Evaluation API, which offers a set of pre- 
built metrics for evaluating gen Al applications and relative SDK, integrated into the Vertex 
Al Python SDK for rapid and flexible, notebook-based, prototyping. To get started with Rapid 
Evaluation Vertex Al SDK see example in the official documentation.® 


Predict: Vertex Al endpoints & monitoring 


Once developed, a production gen Al application must be deployed, including all its model 
components. If the application uses any models that have been trained or adapted, those 
models need to be deployed to their own serving endpoints. You can serve any model in the 
Model Garden through Vertex Al Endpoints*,which acts as the gateway for deploying your 
trained machine learning models. They allow you to serve online predictions with low latency, 
manage access controls, and monitor model performance easily through Model Monitoring. 
Endpoints also offer scaling options to handle varying traffic demands, ensuring optimal user 
experience and reliability. 
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Along with the prediction service, Vertex Al offers the following features for all Google 
managed models: 


+ Citation checkers: Gen Al on Vertex performs Citation checks”. Citations are important 
for LLMs and gen Al for several reasons. Citing sources ensures proper acknowledgment 
of sources and prevents plagiarism and demonstrates transparency and accountability. 
Citing sources is essential for LLMs and gen Al also because they help identify, 
understand potential biases, and enable reproducibility and verification. For example in 
Google Cloud,* the gen Al models are designed to produce original content, limiting the 
possibility of copying existing contents. If this happens, Google Cloud provides quotes for 
websites and code repositories. 


- Safety scores: Safety attributes are crucial for LLMs and gen Al to mitigate potential 
risks like bias, lack of explainability, and misuse. These attributes help detect and mitigate 
biased outputs and mitigate misuse, enabling these tools to be used responsibly. As 
LLMs and gen Al evolve, incorporating safety attributes will be increasingly essential for 
responsible and ethical use. For example, Google Cloud added safety scores in Vertex 
Al PaLM API and Vertex Al Gemini API®°: content processed through the API is checked 
against a list of safety attributes, including "harmful categories" and sensitive topics. Each 
attribute has a confidence score between 0.0 and 1.0, indicating the likelihood of the 
input belonging to that category. These safety filters can be used in conjunction with all 
models: be it proprietary ones like Palm2 and Gemini or OSS ones like the ones available in 
Model garden. 


+ Watermarking: With Al-based tools becoming increasingly popular for creation of 
content, it’s very important to identify if an image has been created using Al. Vertex Al 
offers digital watermarking and verification for Al-generated images®¢ using the algorithm 
SynthID®’ developed by Google DeepMind. 
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+ Content moderation and bias detection: By using the Content moderation® and Bias®? 
detection tools on Vertex Al, you can add an extra layer of security on the responses 
of the LLMs to mitigate the risk that the model training and tuning may sway a model to 
generate outputs that aren’t fair or appropriate for the task. 


Govern: Vertex Al Feature Store, Model Registry, 
and Dataplex 


Addressing the multifaceted requirements of data and model lineage and governance in 
gen Al requires a comprehensive strategy that tackles both conventional challenges and 
novel regulatory or technical complexities associated with large models. By adopting robust 
governance, observability, and lineage practices in the development of gen Al solutions, 
organizations can ensure comprehensive tracking, iteration, and evolution of data. They 

can also track the large models used, prompt adaptations, tuning, and other artifacts. This 
facilitates reproducibility of results, transparency and understanding of generated content 
sources, troubleshooting, compliance enforcement, and enhanced reliability and security. 
These practices collectively enable the ethical and responsible development and deployment 
of gen Al solutions. This fosters internal and external trust and fairness in gen Al models and 
practices. Vertex Al and Google Cloud offer the following comprehensive suite of tools for 
unified lineage, governance and monitoring, effectively addressing these critical concerns. 


In the context of governance and lineage, Vertex Al Feature Store” offers: 


* Track feature and embeddings versions and lineage, ensuring transparency 


* Monitor feature (prompt) and embedding, response drift, and identify potential 
issues proactively 


+ Store feature formulas and discover relevant features or embeddings for different 
use cases 
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- Utilize feature selection algorithms to optimize model performance 


* Consolidate and unify all machine learning data within a singular repository encompassing 
numerical data, categorical data, textual data, and embeddings representations 


Vertex Al Model Registry” serves as a centralized repository for comprehensive lifecycle 
management of both Google proprietary foundational and open-source Machine Learning 
models. This includes gen Al models in addition to predictive models. This unified platform 
enables registration, storage, and version control of diverse model types, including various 
iterations of tuning for large models. Vertex Al Model Registry seamlessly integrates with 
Vertex Pipelines,” facilitating orchestration and management of training and tuning jobs 
while leveraging lineage capabilities for recording and documenting the lineage from 
datasets to models and associated artifacts. It also couples with Vertex Al Experiments” 
and Vertex Al Model Evaluation,”° enabling performance monitoring and comparison of 
different model versions alongside their artifacts — all within a single interface. Furthermore, 
Vertex Al Model Registry bolsters observability by providing integrated configuration and 
access to Vertex Al Model Monitoring” and logging functionalities. This enables proactive 
identification and mitigation of both training-serving skew and prediction drift, ensuring 
reliability and accuracy of deployed models. Users can directly assign desired model versions 
to endpoints for one-click deployment from Vertex Model Registry or leverage aliases for 
simplified deployment. 


Google Cloud Dataplex" provides an organization-wide lineage across product boundaries 
in Google Cloud. Within the domains of Al and gen Al (and more broadly across data analytics 
and Al/ML) Dataplex seamlessly integrates with BigQuery and Vertex Al. Dataplex facilitates 
the unification, management, discovery, and governance of both data and models. Through 
comprehensive data lineage, quality, and metadata management capabilities it provides 
actionable insights for comprehensive data and model understanding. This promotes 
compliance, facilitates data analysis, and guarantees the training of machine learning 

models on trusted data sources. This in turn leads to enhanced accuracy and reliability. This 
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integration permits users across an organization to identify ‘champion models’ and ‘golden 
datasets and features’ across projects and regions in a secure way by adhering to identity 
access management (IAM)*? boundaries. In short, Dataplex encapsulates a framework within 
an organization that governs the interaction between people, processes and technology 
across all the products in Google Cloud. 


Conclusion 


The explosion of gen Al in the last several years introduced fundamental changes in the way 
Al applications are developed — but far from upending the MLOps discipline, these changes 
have only reinforced its basic principles and processes. As we have seen, the principles of 
MLOps that emphasize reliability, repeatability, and dependability in ML systems development 
are comfortably extended to include the innovations of gen Al. Some of the necessary 
changes are deeper and more far-reaching than others, but nowhere do we find any change 
that MLOps cannot accommodate. 


As a result, many tools and processes built to support traditional MLOps can also support 
the requirements of gen Al. Vertex Al, for instance, is a powerful platform that can be used to 
build and deploy machine learning models and Al applications. It provides a comprehensive 
suite of functions for developing both Predictive and gen Al systems, encompassing data 
preparation, pre-trained APls, AUtoML capabilities, training and serving hardware, advanced 
fine-tuning techniques and deployment tools, and a diverse selection of proprietary and 
open-source foundation models. It also offers evaluation methods, monitoring capabilities, 
and governance tools, all unified within a single platform to streamline the Al development 
lifecycle. It’s built on Google Cloud Platform, which provides a scalable, reliable, secure and 
compliant infrastructure for machine learning. It’s a good choice for organizations that want 
to build and deploy machine learning models and Al applications. 
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The next few years will undoubtedly see gen Al extended in directions that today are 
unimaginable. Regardless of the direction these developments take, it will continue to 
be important to build on solid engineering processes that embody the basic principles 
of MLOps. These principles support the development of scalable, robust production Al 
applications today, and no doubt will continue to do so into the future. 
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